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[57] ABSTRACT 

A computer system for analyzing nucleic acid sequences is 
provided. The computer system is used to calculate prob- 
abilities for determining unknown bases by analyzing the 
fluorescence intensities of hybridized nucleic acid probes on 
biological chips. Additionally, information from multiple 
experiments is utilized to improve the accuracy of calling 
unknown bases. 
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COMPUTER-AIDED PROBABILITY BASE 
CALLING FOR ARRAYS OF NUCLEIC ACID 
PROBES ON CHIPS 

This is a continuation application of prior application 5 
Ser. No. 08/528,656 filed on Sep. 14, 1995, now U.S. Pat. 
No. 5,733,729, the disclosure of which is incorporated 
herein by reference. 

GOVERNMENT RIGHTS NOTICE 

10 

Portions of the material in this specification arose under 
the cooperative agreement 70NANB5H1031 between 
Affymetrix, Inc. and the Department of Commerce through 
the National Institute of Standards and Technology. 

COPYRIGHT NOTICE 15 
A portion of the disclosure of this patent document 
contains material which is subject to copyright protection. 
The copyright owner has no objection to the xeroxographic 
reproduction by anyone of the patent document or the patent 
disclosure in exactly the form it appears in the Patent and 
Trademark Office patent file or records, but otherwise 
reserves all copyright rights whatsoever. 

SOFTWARE APPENDIX 

A Software Appendix comprising twenty one (21) sheets 25 
is included herewith. 

BACKGROUND OF THE INVENTION 

The present invention relates to the field of computer 
systems. More specifically, the present invention relates to 30 
computer systems for evaluating and comparing biological 
sequences. 

Devices and computer systems for forming and using 
arrays of materials on a substrate are known. For example, 
PCT application WO92/10588, incorporated herein by ref- 35 
erence for all purposes, describes techniques for sequencing 
or sequence checking nucleic acids and other materials. 
Arrays for performing these operations may be formed in 
arrays according to the methods of, for example, the pio- 
neering techniques disclosed in U.S. Pat. No. 5,143,854 and 40 
U.S. patent application Ser. No. 08/249,188, now U.S. Pat. 
No. 5,571,639, both incorporated herein by reference for all 
purposes. 

According to one aspect of the techniques described 
therein, an array of nucleic acid probes is fabricated at 45 
known locations on a chip or substrate. A fluorescently 
labeled nucleic acid is then brought into contact with the 
chip and a scanner generates an image file (also called a cell 
file) indicating the locations where the labeled nucleic acids 
bound to the chip. Based upon the image file and identities 50 
of the probes at specific locations, it becomes possible to 
extract information such as the monomer sequence of DNA 
or RNA Such systems have been used to form, for example, 
arrays of DNA that may be used to study and detect 
mutations relevant to cystic fibrosis, the P53 gene (relevant 55 
to certain cancers), HIV, and other genetic characteristics. 

Innovative computer-aided techniques for base calling are 
disclosed in U.S. patent application Ser. No. 08/327,525, 
which is incorporated by reference for all purposes. 
However, improved computer systems and methods are still 60 
needed to evaluate, analyze, and process the vast amount of 
information now used and made available by these pioneer- 
ing technologies. 

SUMMARY OF THE INVENTION 55 
An improved computer-aided system for calling unknown 
bases in sample nucleic acid sequences from multiple 



,454 

2 

nucleic acid probe intensities is disclosed. The present 
invention is able to call bases with extremely high accuracy 
(up to 98.5%). At the same time, confidence information 
may be provided that indicates the likelihood that the base 
has been called correctly. The methods of the present 
invention are robust and uniformly optimal regardless of the 
experimental conditions. 

According to one aspect of the invention, a computer 
system is used to identify an unknown base in a sample 
nucleic acid sequence by the steps of: inputting a plurality of 
hybridization probe intensities, each of the probe intensities 
corresponding to a nucleic acid probe; for each of the 
plurality of probe intensities, determining a probability that 
the corresponding nucleic acid probe best hybridizes with 
the sample nucleic acid sequence; and calling the unknown 
base according to the nucleic acid probe with the highest 
associated probability. 

According to another aspect of the invention, an unknown 
base in a sample nucleic acid sequence is called by a base 
call with the highest probability of correctly calling the 
unknown base. The unknown base in the sample nucleic acid 
sequence is identified by the steps of: inputting multiple base 
calls for the unknown base, each of the base calls having an 
associated probability which represents a confidence that the 
unknown base is called correctly; selecting a base call that 
has a highest associated probability; and calling the 
unknown base according to the selected base call. The 
multiple base calls are typically produced from multiple 
experiments. The multiple experiments may be performed 
on the same chip utilizing different parameters (e.g., nucleic 
acid probe length). 

According to yet another aspect of the invention, an 
unknown base in a sample nucleic acid sequence is called 
according to multiple base calls that collectively have the 
highest probability of correctly calling the unknown base. 
The unknown base in the sample nucleic acid sequence is 
identified by the steps of: inputting multiple probabilities for 
each possible base for the unknown base, each of the 
probabilities representing a probability that the unknown 
base is an associated base; producing a product of probabili- 
ties for each possible base, each product being associated 
with a possible base; and calling the unknown base accord- 
ing to a base associated with a highest product. The multiple 
base calls are typically produced from multiple experiments. 
The multiple experiments may be performed on the same 
chip utilizing different parameters (e.g., nucleic acid probe 
length). 

According to another aspect of the invention, both strands 
of a DNA molecule are analyzed to increase the accuracy of 
identifying an unknown base in a sample nucleic acid 
sequence by the steps of: inputting a first base call for the 
unknown base, the first base call determined from a first 
nucleic acid probe that is equivalent to a portion of the 
sample nucleic acid sequence including the unknown base; 
inputting a second base call for the unknown base, the 
second base call determined from a second nucleic acid 
probe that is complementary to a portion of the sample 
nucleic acid sequence including the unknown base; selecting 
one of the first or second nucleic acid probes that has a base 
at an interrogation position which has a high probability of 
producing correct base calls; and calling the unknown base 
according to the selected one of the first or second nucleic 
acid probes. 

A further understanding of the nature and advantages of 
the inventions herein may be realized by reference to the 
remaining portions of the specification and the attached 
drawings. 
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BRIEF DESCRIPTION OF THE DRAWINGS in FIG. 1, computer system 1 includes monitor 3 and 

, , keyboard 9. Computer system 1 further includes subsystems 

FIG. 1 illustrates an example of a computer system used such ^ a cen[ral processor 52 tem me 54 T/G 

to execute the software of the present invention; controller 56, display adapter 58, serial port 62, disk 64, 

FIG. 2 shows a system block diagram of a typical com- 5 ne twork interface 66, and speaker 68. Other computer 

puter system used to execute the software of the present systems suitable for use with the present invention may 

invention; include additional or fewer subsystems. For example, 

FIG. 3 illustrates an overall system for forming and another computer system could include more than one 

analyzing arrays of biological materials such as DNA or processor 52 (i.e., a multi-processor system) or memory 

RNA; to cache. 

FIG. 4 is an illustration of the software for the overall Arrows such as 70 represent the system bus architecture 
system; of computer system 1. However, these arrows are illustrative 
FIG *5 illustrates the global layout of a chip formed in the of an y interconnection scheme serving to link the sub- 
overall system- systems. For example, speaker 68 could be connected to the 
™« , .„ ' „ , t . j. r , 15 other subsystems through a port or have an internal direct 
FIG. 6 illustrates conceptually the binding of probes on corm6Ction t0 cei? t ra l processor 52. Computer system 1 

c shown in FIG. 2 is but an example of a computer system 

FIG. 7 illustrates probes arranged in lanes on a chip; . suitable for user with the present invention. Other configu- 

FIG. 8 illustrates a hybridization pattern of a target on a rations of subsystems suitable for use with the present 

chip with a reference sequence as in FIG. 7; 20 invention will be readily apparent to one of ordinary skill in 

FIG. 9 illustrates the high level flow of the probability me art- 
base calling method; and The VLSIPS technology provides methods of making 

FIG. 10 illustrates the flow of the maximum probability v ^ la ^ e f^ s ri of fgonucleotide probes on very small 

method . K J chips. See U.S. Pat. No. 5,143,854 and PCT patent pubti- 

„ ' « .„ t a r , j _ - . „. . 25 cation Nos. WO 90/15070 and 92/10092, each of which is 

hS P P^abilities incorporated by reference for all purposes. Hie oligonucle- 

me * otide probes on the "DNA chip" are used to detect comple- 

FIG. 12 illustrates the flow of the wild-type base prefer- mentary nucleic acid sequences in a sample nucleic acid of 

ence method. interest (the "target" nucleic acid). 

30 The present invention provides methods of analyzing 
hybridization intensity files for a chip containing hybridized 
nucleic acid probes. In a representative embodiment, the 

Contents *^ eS re P resent fluorescence data from a biological array, but 

the files may also represent other data such as radioactive 

I. General 35 intensity data. Therefore, the present invention is not limited 

II. Probability Base Calling Method to analyzing fluorescent measurements of hybridizations but 

III. Maximum Probability Method b * rcadily utiHzed t0 anal y zc other measurements of 

J hybridization. 

IV. Product of Probabilities Method For purposes of illustration, the present invention is 
V Wild-Type Base Preference Method 40 described as being part of a computer system that designs a 
VI. Software Appendix chip mask, synthesizes the probes on the chip, labels the 

I. General nucleic acids, and scans the hybridized nucleic acid probes. 

In the description that follows, the present invention will Such a system is fully described in U.S. patent application 

be described in reference to a Sun Workstation in a UNIX Ser. No. 08/249,188, now U.S. Pat. No. 5,591,639 which has 

environment. The present invention, however, is not limited 45 been incorporated by reference for all purposes. However, 

to any particular hardware or operating system environment. the present invention may be used separately from the 

Instead, those skilled in the art will find that the systems and overall system for analyzing data generated by such sys- 

methods of the present invention may be advantageously tems. 

applied to a variety of systems, including IBM personal FIG. 3 illustrates a computerized system for forming and 

computers running MS-DOS or Microsoft Windows. 50 analyzing arrays of biological materials such as RNA or 

Therefore, the following description of specific systems are DNA. A computer 100 is used to design arrays of biological 

for purposes of illustration and not limitation. polymers such as RNA or DNA. The computer 100 may be, 

FIG. 1 illustrates an example of a computer system used for example, an appropriately programmed Sun Workstation 

to execute the software of the present invention. FIG. 1 or personal computer or workstation, such as an IBM PC 

shows a computer system 1 which includes a monitor 3, 55 equivalent, including appropriate memory and a CPU as 

screen 5, cabinet 7, keyboard 9, and mouse U. Mouse 11 shown in FIGS. 1 and 2. The computer system 100 obtains 

may have one or more buttons such as mouse buttons 13. inputs from a user regarding characteristics of a gene of 

Cabinet 7 houses a floppy disk drive 14 and a hard drive (not interest, and other inputs regarding the desired features of 

shown) that may be utilized to store and retrieve software the array. Optionally, the computer system may obtain 

programs incorporating the present invention. Although a 60 information regarding a specific genetic sequence of interest 

floppy disk 15 is shown as the removable media, other from an external or internal database 102 such as GenBank. 

removable tangible media including CD-ROM and tape may The output of the computer system 100 is a set of chip design 

be utilized. Cabinet 7 also houses familiar computer com- computer files 104 in the form of, for example, a switch 

ponents (not shown) such as a processor, memory, and the matrix, as described in PCT application WO 92/10092, and 

like. 65 other associated computer files. 

FIG. 2 shows a system block diagram of computer system The chip design files are provided to a system 106 that 

1 used to execute the software of the present invention. As designs the lithographic masks used in the fabrication of 
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